Skip to content

feat(npi): Implement agentic approach to GCSFuse NPI#166

Merged
kislaykishore merged 42 commits into
mainfrom
npi-validation-refactor
Jun 20, 2026
Merged

feat(npi): Implement agentic approach to GCSFuse NPI#166
kislaykishore merged 42 commits into
mainfrom
npi-validation-refactor

Conversation

@kislaykishore

@kislaykishore kislaykishore commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

This pull request includes several robustness improvements to the GCSFuse NPI validation and benchmarking workflow:

  • Implement an agent for running GCSFuse NPI.
  • Automatically resolve Go version from the GCSFuse repo's go.mod during Docker image builds to ensure version alignment.
  • Update runner agent instructions to enforce user targets inputs and check for active/suspended runs before starting.
  • Implement stall monitoring instructions and safety timeouts for both conformance tests and performance benchmarks (reducing orchestrator logging inactivity timeout to 5 mins).
  • Add detailed failure analysis guidelines for conformance testing.
  • Separate GCE VM and GKE Node performance runs into distinct sections and warnings in analysis and remediation guides, discouraging incorrect direct comparisons.
  • Explicitly treat remediation planning as advisory-only without performing automatic mutations.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces modular agent skills and workflow definitions for GCSFuse NPI validation, along with a verification script to validate deliverables. Key code changes include dynamic Go version resolution in image builds, billing project configuration in GKE jobs, and an increased inactivity timeout in the orchestrator. The review feedback highlights several critical improvements: resolving a hardcoded local directory path and a fragile test count assertion in the verification script, preventing a potential AttributeError in the orchestrator when parsing GCS bucket metadata, and reverting environment-specific configurations in targets.json to generic placeholders.

Comment thread npi/verify_agent_workflow.py
Comment thread npi/npi_orchestrator.py Outdated
Comment thread npi/verify_agent_workflow.py Outdated
Comment thread npi/targets.json Outdated
Comment thread npi/targets.json Outdated
@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/Gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modularizes the GCSFuse NPI validation and benchmarking skills, dynamically resolves the Go version from go.mod during image builds, adds billing project support to GKE jobs, increases the orchestrator inactivity timeout to 4 hours, and introduces a fallback to RAM-based memory volumes in the RAID setup script. It also adds a workflow verification script. The review feedback highlights critical fixes needed in the verification script, including resolving file path and header mismatches, lowering the minimum test threshold to prevent false failures, aligning the timeout documentation with the code, and safely handling potential null values in the orchestrator's bucket metadata parsing.

Comment thread npi/verify_agent_workflow.py Outdated
Comment thread npi/verify_agent_workflow.py
Comment thread npi/verify_agent_workflow.py Outdated
Comment thread npi/.gemini/skills/benchmark-suite-execution/SKILL.md Outdated
Comment thread npi/npi_orchestrator.py Outdated

@kislaykishore kislaykishore left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All code review feedback has been successfully resolved:

  1. Updated the inactivity timeout documentation in benchmark-suite-execution/SKILL.md to match the code's 4-hour (14400s) timeout.
  2. Added safe dictionary navigation (meta.get("customPlacementConfig") or {}) in npi_orchestrator.py to prevent potential AttributeErrors when GCS bucket customPlacementConfig returns None.

…s, report headers, and wildcard conformance result files

@kislaykishore kislaykishore left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All remaining code review comments on the verification script have been successfully resolved:

  1. Wildcard Conformance JSON Support: Added a glob search fallback in verify_agent_workflow.py to dynamically search for and verify any target-specific conformance JSON results (e.g. conformance_results_*.json) if the default file is not found.
  2. Report Headers Alignment: Aligned the expected headers in the verification script with the actual headers generated in the report (matching the ## Target Performance Results section).
  3. Realistic Test Threshold: Lowered the minimum test count threshold in the verification script to a more realistic limit (total_tests < 100) to prevent false pipeline blocks.

@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/gemini review

@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modularizes the GCSFuse NPI validation and benchmarking workflow into distinct agent skills, including conformance testing, SSH management, benchmark setup, execution, analysis, and remediation. Key enhancements include a programmatic verification script, a RAM-based tmpfs fallback for hosts without local SSDs, dynamic Go version resolution from go.mod with input validation, and support for billing project annotations in GKE jobs. The review feedback suggests a safer prefix-based check in npi_gke.py to avoid false positives when detecting the billing-project mount option.

Comment thread npi/npi_gke.py Outdated
@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modularizes the GCSFuse NPI validation and benchmarking workflow by splitting the monolithic run-gcsfuse-npi skill into several agent-specific modular skills. It also introduces dynamic Go version resolution from go.mod in build_images.py, adds input sanitization, supports RAM buffer fallback (tmpfs) in raid0-script.sh for hosts without local SSDs, integrates billing project options in GKE job specs, and adds a programmatic workflow verification script. Feedback on the changes includes addressing a potential bug in npi_gke.py where checking for an existing billing-project option on an unsplit string can fail if multiple comma-separated options are present, and improving verify_agent_workflow.py to ensure target-specific conformance results are not skipped if a stale conformance_results.json file exists.

Comment thread npi/npi_gke.py
Comment thread npi/verify_agent_workflow.py Outdated
… both default and target-specific conformance JSONs
@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modularizes the GCSFuse NPI validation and benchmarking workflow into distinct, agent-specific skills (SSH management, conformance testing, benchmark setup, execution, analysis, and remediation) and introduces several robust enhancements. Key updates include dynamically resolving the Go version from the GCSFuse go.mod file, automatically injecting the GCP billing project into GKE CSI mount options, adding a 500GB tmpfs RAM disk fallback in raid0-script.sh for high-RAM hosts, and introducing a programmatic verification script for validation deliverables. Feedback on the changes suggests improving the robustness of extra_flag parsing in npi_gke.py by normalizing spaces around the '=' sign to prevent duplicate billing-project options.

Comment thread npi/npi_gke.py
@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modularizes the GCSFuse NPI validation and benchmarking documentation into several agent-specific skills, such as conformance testing, SSH connection management, benchmark build/setup, suite execution, analysis, and remediation. Additionally, it updates the Python orchestration scripts to dynamically resolve Go versions from go.mod, handle custom placement configurations for RAPID buckets, support RAM tmpfs fallbacks when local SSDs are missing, and introduce a verification script for deliverables. The review feedback focuses on improving code quality by cleaning up redundant local imports of 're' and 'glob' and moving them to the top of the files in accordance with PEP 8 guidelines.

Comment thread npi/build_images.py Outdated
Comment thread npi/build_images.py Outdated
Comment thread npi/verify_agent_workflow.py
Comment thread npi/verify_agent_workflow.py Outdated
@kislaykishore

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modularizes the GCSFuse NPI validation and benchmarking workflow by splitting the monolithic run-gcsfuse-npi skill into several focused, agent-specific skill files (conformance testing, SSH connection management, benchmark build/setup, benchmark suite execution, analysis/report generation, and remediation advisor). Additionally, it introduces a programmatic verification script (verify_agent_workflow.py), adds a RAM-based tmpfs fallback in raid0-script.sh for hosts without local SSDs, dynamically resolves the Go version from go.mod in build_images.py with strict input sanitization, and enhances Kubernetes job specification generation in npi_gke.py to support billing project configurations. No review comments were provided, so there is no feedback to address.

@kislaykishore kislaykishore force-pushed the npi-validation-refactor branch from c85935e to 79905cd Compare June 20, 2026 07:20
…sting RAID0 mounts, clarify memory buffer fallback on TPU GCE VMs, and add .gitignore for ephemeral .agents workspace
@kislaykishore kislaykishore force-pushed the npi-validation-refactor branch from 79905cd to 2edfc2f Compare June 20, 2026 07:21
@kislaykishore kislaykishore force-pushed the npi-validation-refactor branch 2 times, most recently from f8420a8 to 28a950b Compare June 20, 2026 14:52
@kislaykishore kislaykishore force-pushed the npi-validation-refactor branch from 28a950b to 2904bdc Compare June 20, 2026 15:00
@kislaykishore kislaykishore changed the title feat(npi): conformance stall monitoring, version resolution, and independent evaluation guidelines feat(npi): Implement agentic approach to GCSFuse NPI Jun 20, 2026
@kislaykishore kislaykishore merged commit 1bd8b85 into main Jun 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant